Different structural variant prediction tools yield considerably different results in Caenorhabditis elegans

The accurate characterization of structural variation is crucial for our understanding of how large chromosomal alterations affect phenotypic differences and contribute to genome evolution. Whole-genome sequencing is a popular approach for identifying structural variants, but the accuracy of popular tools remains unclear due to the limitations of existing benchmarks. Moreover, the performance of these tools for predicting variants in non-human genomes is less certain, as most tools were developed and benchmarked using data from the human genome. To evaluate the use of long-read data for the validation of short-read structural variant calls, the agreement between predictions from a short-read ensemble learning method and long-read tools were compared using real and simulated data from Caenorhabditis elegans. The results obtained from simulated data indicate that the best performing tool is contingent on the type and size of the variant, as well as the sequencing depth of coverage. These results also highlight the need for reference datasets generated from real data that can be used as ‘ground truth’ in benchmarks.


Introduction
Large alterations in chromosome structure contribute substantially to the genetic diversity observed in natural populations and play a fundamental role in the evolution of novel genes [1]. These changes that span large segments of the genome (e.g., > 100 bp) are termed structural variants (SVs) and include deletions, tandem and interspersed duplications, insertions, and inversions. SVs may be neutral, deleterious, or adaptive [2], and are known to facilitate speciation [3]. SVs drive genome evolution using several mechanisms. For example, large heterozygous inversions can suppress recombination, thereby protecting locally adapted alleles [4]. Also, copy number variation (CNV) is an important factor in genome evolution that describes the gain or loss of genes. CNVs are associated with a wide range of phenotypic effects due to the modulation of gene expression, including differential drug responses between individuals [5], HIV susceptibility [6], autism spectrum disorders [7], and schizophrenia [8]. Gene duplication and subsequent diversification is a source of novel genes and functional

Structural variation prediction using simulated short-read data
To evaluate the performance of various short-read structural variant calling methods, SVsim was used to create mock genomes containing simulated deletions, duplications, and inversions. Two datasets were created from the C. elegans reference genome: a training dataset used to train the FusorSV fusion model, and a testing dataset used to benchmark the performance of each caller. Each dataset contained simulated SVs that ranged from 100bp to 280kbp. Because FusorSV uses variable number of bins to discriminate SVs by size and type during the training phase, 129 mock genomes were created for each dataset so that each bin contained a total of 30 SVs, while ensuring that the total number of base pairs spanned by SVs in each mock genome did not exceed 1% of the C. elegans reference genome. The mock genomes were used to generate simulated Illumina reads at 5X, 15X, 30X, and 60X sequencing depth of coverage. The caller performance varied by variant type and depth (Table 1). For deletion calls, DELLY had the highest F-measure scores at all sequencing depths, followed closely by Break-Dancer. The sequencing depth had a stronger impact on the other variant callers, except for Hydra. The accuracies for cnMOPS and CNVnator all improved considerably above 5X, while increased false positives accounted for the decreased accuracy in Lumpy at 60X. The performance of FusorSV was similar the best performing tools for each variant type at all depths.
No single metric completely describes the performance of each variant caller. Because the precision, recall, and F1 scores were calculated based on the number of true positives, false positives, and false negatives, they fail to describe the performance of each caller at the base pair level. The Jaccard similarity score was used to describe the base pair overlap between the predicted variants and simulated variants. Because the Jaccard score is based on the proportion of intersecting base pairs, the metric is biased towards the performance of larger SVs. It should be noted that the overall Jaccard scores described here are biased towards the performance for larger variants, especially the overall values calculated using the entire range of SV sizes. The Jaccard metric was also used to quantify the performance of each caller at 60X sequencing depth for a range of SV size ranges. Although the impact of size on the Jaccard score would be decreased for each range compared to overall score calculated from the entire set of predicted SVs, the metric would remain biased towards the performance for the larger SV sizes contained in each range.
Although DELLY had the highest accuracy at all depths, its Jaccard value decreased from 0.99 at 30X depth to 0.74 at 60X depth (Fig 1). This decrease was caused by a 2.56Mbp false positive at the 60X depth. Differences between the Jaccard and F1 scores were observed for several sizes. At 5X depth, the CNVnator F1 and Jaccard scores were 0.53 and 0.91 respectively (S1 Table). Because CNVnator performed well at predicting larger variants, a higher Jaccard score was obtained despite low accuracy for smaller deletion sizes. Conversely, the Hydra F1 scores ranged between 0.72 and 0.79, while its Jaccard scores ranged between 0.12 and 0.13. The large discrepancies between the F1 score and Jaccard scores resulted from good performance for smaller deletions but poor performance for larger sizes.
For duplication calls, DELLY had the highest F-measure scores at all depths, followed closely by BreakDancer. The accuracies of CNVnator, Hydra, and Lumpy improved with increased depth between 5X and 30X, while only a slight decrease was observed for Hydra at 60X depth. Increased depth decreased the cnMOPS accuracy due to higher false positive rates. The performance of FusorSV was similar the best performing tools for each variant type at all depths.
The performance of cnMOPS and Hydra were both dependent on the size of the predicted duplications. The rate of false positive duplications increased with higher sequencing depths for cnMOPS and was biased towards smaller duplication sizes (Fig 2). All predicted duplications from Hydra were below 10 kbp, which led to a Jaccard score of 0.01 at all depths (S2 Table). At 60X depth, DELLY predicted a 2.32 Mbp false positive duplication, which resulted in a decreased Jaccard score (0.83).
BreakDancer, DELLY, and Lumpy performed well for the prediction of inversions, as the accuracy of each tool was at least 0.92 at all depths. The accuracy for Hydra was considerably lower at 5X and decreased with increasing depth. The precision, recall, and F1 score for FusorSV was similar to the best performing tools for each variant type at all depths. However, lower Jaccard scores were observed in FusorSV (S3 Table) due to several multiple megabase spanning false positives that were not predicted by other callers.
The performance of BreakDancer, Hydra, and DELLY were dependent on the size of the predicted inversions (Fig 3). Both BreakDancer and DELLY performed better for higher

Prediction of known structural variants in C. elegans
BC4586 is a C. elegans strain containing experimentally validated structural variants [32]. Publicly available Illumina sequencing data allowed us to determine if the short-read SV callers in

Agreement in predicted structural variants for wild C. elegans strains
To evaluate the usefulness of long-read DNA sequencing data to validate structural variants predicted from short-read technologies, we obtained data for 14 C. elegans wild strains with both Illumina and PacBio sequencing data. The predicted variants varied considerably between the callers ( Table 4). The predicted deletions ranged from a median of 76 per strain in SVIM to 341 in cnMOPs. Conversely, cnMOPs only predicted a median of 5.5 duplications per strain compared to 129 in Sniffles. The median predicted inversions per strain ranged from 8 in Lumpy to 88 in BreakDancer.
Because the exact breakpoints for the same predicted variant may differ between callers, it is difficult to directly compare the agreement of calls generated by different tools. Therefore, predicted variants spanning protein-coding genes were used to compare caller agreement. For each comparison between callers, only predictions of a given SV type that spanned the exact same set of genes were considered to be in agreement with each other. Overlapping predictions between the long-read callers and FusorSV were compared to evaluate using long-read sequencing data for the validation of variants predicted from short-read data.
Among the total set of predicted deletions spanning genes, 190 were shared among all longread callers. Of these deletions, 119 were predicted by FusorSV and 64% of the genes spanned by deletions predicted by FusorSV were not shared by at least one long-read caller (Fig 4; S4 Table). Within the set of genes overlapping deletions predicted by SVIM, 95% were shared by at least one other caller. The other long-read tools had higher counts of unique predictions not found in any other caller (Assemblytics = 84%, MUM&Co = 33%, pbsv = 28%, sniffles = 45%, SVIM = 5%). The percentage of unique genes spanned by deletions also varied among the short-read callers (Fig 4; S5 Table) and ranged from 3% in DELLY to 80% in cnMOPS and CNVnator.
Among the total set of predicted tandem duplications spanning genes, only 66 were shared among all long-read callers. Of these only 52 were predicted by FusorSV (Fig 5; S6 Table). 77% of the genes spanned by duplications predicted by FusorSV were not shared by at least one long-read caller. Within the set of genes overlapping duplications, MUM&Co contained the fewest unique predictions (11%) followed by SVIM (14%), Assemblytics (20%), pbsv (42%), and Sniffles (44%). The percentage of unique genes spanned by duplications also varied considerably among the short-read callers (S7 Table) and ranged from 5% in DELLY to 90% in CNVnator.
Among the total set of predicted inversions spanning genes, 59 were shared among all longread callers. Of these inversions, seven were predicted by FusorSV (Fig 6; S8 Table). Within the set of genes overlapping inversions, SVIM contained the fewest unique predictions (41%) followed by pbsv (51%), MUM&Co (60%), and Sniffles (69%). 89% of the genes spanned by inversions predicted by FusorSV were not shared by at least one long-read caller. The percentage of unique genes spanned by inversions also varied considerably among the short-read callers (S9 Table) and ranged from 17% in Lumpy to 47% in Hydra.

Discussion
Structural variant (SV) prediction is challenging for researchers studying non-human genomes. Most SV prediction tools were designed with the human genome in mind, and benchmarks on other species are lacking [13,15,16,19]. Without an adequate guide for tool selection, the accuracy of predicted SVs has a high degree of uncertainty. To benchmark accuracy, we used simulated data to evaluate six short-read structural variant callers included in SVE [15], a pipeline developed to be used with FusorSV, an ensemble learning method that leverages the strengths of each individual caller. We further used real short-and long-read data to demonstrate the concordance or discordance between callers. We acknowledge that our selection of prediction tools is not comprehensive-a near impossible goal-and that other software has been released while we undertook this study.
The results for the simulated short-read data suggest that deletions and duplications may be predicted with high confidence using BreakDancer and DELLY, and accurate inversion predictions may be obtained using BreakDancer, DELLY, and Lumpy. The FusorSV performance typically reflected that of the best performing individual tools, but occasionally predicted large megabase spanning false positives. WGS from the 1000 Genomes Project (1000GP) [33] may be used to benchmark SV prediction using real data. These data provide a high-confidence human truth set, as SVs were validated using a combination of short-and long-read WGS data, Moleculo synthetic long-read sequencing, microarray SV detection, and targeted longread sequencing. Our results are in discordance with the values reported in the literature for benchmarks generated using human data from the 1000 Genomes Project (1000GP) [15]. For example, the accuracy of deletion calls from 1000GP data were considerably lower for Break-Dancer (F1-score = 0.47), DELLY (F1-score = 0.54), and FusorSV (F1-score = 0.62). Our results for duplications and inversions were substantially better than those reported for Break-Dancer (duplication F1-score = 0.00, inversion F1-score = 0.08) DELLY (duplication F1-score = 0.01, inversion F1-score = 0.09), and FusorSV (duplication F1-score = 0.19, inversion F1-score = 0.45) based on 1000GP data. The Genome in a Bottle Consortium (GIAB) recently published a human SV benchmarking dataset containing 12,745 sequence resolved insertions (7,281) and deletions (5,464) [34]. These calls were generated from 19 variant calling methods using data from short-read, long-read, linked-read, optical, and electronic genome mapping. It should be noted that tandem duplications were categorized as insertions in these data, which would lead to decreased performance in callers that discriminate between these two variant types. These data were used to benchmark Parliament2, a consensus based method that predicts SVs using support from multiple individual callers [16]. For the prediction of deletions from 35X depth short-read data, Parliament2 achieved an overall F1-score of 0.82, which included Delly (F1-score = 0.65), BreakDancer (F1-score = 0.59), Lumpy (F1-score = 0.50), and CNVnator (F1-score = 0.11) among the set of tools used for consensus calling. The performance of each of these tools was inferior to the performance we observed at 30X. Differences in genome properties, such as repeat content, could account in part for the improvements seen here, but the usage of simulated data was likely a major factor. Real data with experimentally validated variants would be valuable for identifying the underlying causes of these differences.
It should be noted that SVIM generates a VCF file containing all candidate SV calls, including calls of low confidence. Each prediction includes a quality score between 0 and 100 that provides a confidence estimate. The authors recommend using a threshold between 10-15 for higher depth datasets (e.g., >40X) or a threshold that generates the expected number of predictions. Therefore, for datasets with lower sequencing depth, the analyst may be limited to selecting an arbitrary cut-off when the expected number of SVs is unknown. The thresholds we used for 5X (minimum QUAL = 2), 15X (minimum QUAL = 5), and 30X depth (minimum QUAL = 10) were proportional to the decrease in depth compared to the high depth specified by the SVIM authors. The performance for these cut-offs were similar to the optimum cut-off values that were calculated post-hoc (S10-S12 Tables), with the exception of inversions predicted at 142X depth, where a higher threshold is recommended. Deletion benchmarks were previously described for pbsv and Sniffles using data from the Database of Genomic Variants [35] and NCBI dbVar [36] projects. The precision and recall were quantified for different numbers of reads supporting the deletions. Precision values up to 0.91 and 0.81 were reported for pbsv and Sniffles, respectively. Lower recall values were observed for pbsv (up to 0.45) and Sniffles (up to 0.26). By contrast, we observed lower precision but higher recall using simulated data.
Illumina and PacBio sequencing data from 14 natural C. elegans strains were analyzed to determine the concordance between predicted SVs among both short-and long-read callers. Low agreement was observed among all predictions generated using either Illumina or PacBio data. Furthermore, many SV calls unique to a single caller were observed for predictions made using either short or long-reads. It is therefore difficult to ascertain the accuracy of structural variants described in the literature, as the considerably different results may be generated using different tools. Nonetheless, the simulated data suggests that short-read tools, such as DELLY, BreakDancer, and Lumpy are likely to provide more accurate SV calling across a range of depths compared to the other short-read tools that were included in these benchmarks. If training data are available, FusorSV may also be used, but large megabase spanning inversions should be interpreted with caution, as several large false positives were predicted by this tool.
For SV prediction from long-reads, the simulated data suggested that neither pbsv, Sniffles, nor SVIM may be used with high-confidence for the prediction of duplications or inversions using lower depth data (e.g., 5X). Fewer generalizations can be made for higher depths. SVIM had the best performance for the prediction of deletions at 15X depth, while the performance of Sniffles was superior at 30X. Sniffles performed the best for the prediction of duplications and inversions at 15X and 30X depth. At 60X depth, SVIM had the highest accuracy for deletions, but Sniffles and pbsv demonstrated superior performance for duplications and inversions, respectively. At 142X depth, the SVIM accuracy was considerably higher than pbsv and Sniffles for deletions and duplications, while the accuracy of pbsv was considerably higher for inversions. If precision is less of a concern than recall, Sniffles may be the preferable choice for predicting duplications from long-read data.
The concordance between long-read callers is pertinent if long-read data is to be used to validate candidate SVs called using short-read data. Although few predicted SVs were common to all long-read callers, a large majority of the predicted deletions and duplications from SVIM were supported by at least one other caller. Therefore, SVIM might provide a more conservative option for validating deletions and duplications predicted from short-reads. Although inversions predicted by SVIM had the highest support among other callers, over one quarter of these calls were unique to SVIM. Therefore, long-read data may be less reliable for validating inversions predicted from short-reads.
To assess the concordance between short-and long-read approaches, the agreement between FusorSV and the long-read tools was measured. For each variant type, over three quarters of the FusorSV predictions were not shared by any of the long-read tools. Because higher accuracy has been reported for long-read tools in the literature, it is likely that FusorSV generated many false positives. Low agreement was observed for many of the SVs predicted by the individual tools used to train FusorSV despite higher accuracy observed in the simulated data. This may reflect a limitation in using simulated data to train FusorSV, as simulated data may bias the FusorSV models towards callers that perform poorly on real data. The low agreement observed between FusorSV and the long-read approaches is consistent with the results previously reported for CNV prediction in cattle [37]. In this study, the authors used CNVnator and Sniffles to predict deletions and duplications using Illumina and PacBio sequencing data. After filtering out probable false positives, only 18% of the CNVs predicted by CNVnator overlapped with those predicted by Sniffles. For CNVs spanning genes, 22% of the CNVnator calls overlapped with a Sniffles prediction.
The low agreement between different SV callers emphasizes the challenges involved in the selection of ideal tools for SV calling. Without high quality benchmarks in non-human species, tool choice is largely arbitrary, and the analyst will likely select a caller based on popularity. Despite the lack of benchmarks, researchers increasingly rely upon these tools for the characterization of SVs in non-human species, calling into question the reliability and reproducibility of past research findings. Improved benchmarks will provide an important resource for the research community.

Conclusions
It is challenging to choose the appropriate tool for the prediction of structural variants from DNA sequencing data. Dozens of callers have been developed for calling structural variants using short-read data, but few independent benchmarks are available. Compounding this problem is the lack of benchmarks for non-human genomes. Here, multiple short-read and long-read callers were compared using both real and simulated C. elegans data. The results using simulated data showed that the performance of a given tool often varies considerably according to variant type and sequencing depth and that no single tool performed best for all situations. The predictions generated from real data showed low overlap among all callers and many predictions unique to individual tools.
Because variants predicted from short-reads depend highly on the tool used, the analyst may choose to validate these SVs using long-read data. However, the lack of a consensus among long-read callers suggests that using a long-read caller to generate a "truth-set" warrants caution. Nonetheless, most of the deletions and duplications predicted by SVIM and MUM&Co, respectively, were shared by at least one other caller. These tools may provide a more conservative approach for validating SV calls using long-reads. The availability of reference datasets for which the "ground truth" is known would provide valuable resources for improving our understanding of the best approaches for SV prediction in non-human organisms. Future benchmarking projects would benefit from publicly available data from strains with precise deletions of various lengths generated using CRISPR-Cas9 methods, as well as further lab strains with SVs validated manually using long-read technologies and PCR.

Structural variation prediction
Structural Variation Engine (SVE) and FusorSV (v0.1.3-beta) [15] were used to predict structural variants (deletions, duplications, and insertions) from real and simulated short-read sequencing data. SVE is an SV calling pipeline that produces VCF files compatible with FusorSV. FusorSV uses an ensemble learning approach to call structural variants using a fusion model trained using individual callers. The six structural variant callers included in SVE that support non-human genomes were evaluated here: BreakDancer [38], cnMOPS [39], CNVnator [40], DELLY [41], Hydra [42], and Lumpy [22]. The default SVE and FusorSV parameter settings were used.
Custom Python (v3.7) and Bash scripts were used to select the final set of SV predictions to benchmark based on the following criteria: minimum size > = 100 bp, and vcf file FILTER flag = "PASS".

Simulated data
SVsim (https://github.com/mfranberg/svsim; v. 0.1.1) was used to create mock C. elegans genomes containing simulated structural variants (deletions, duplications, and inversions) based on the WormBase [48] (Wbcel235; https://parasite.wormbase.org/Caenorhabditis_ elegans_prjna13758/Info/Index/) reference assembly for the N2 strain. Because FusorSV uses the SV type and size as discriminating features to train the fusion model, the training dataset was designed to have 30 SVs of each type for each size/type bin. A total of 129 mock genomes with variable numbers of simulated deletions, duplications, and inversions were created ranging from 100bp to 280kbp. This ensured that the total base pairs spanned by SVs in each mock genome did not exceed 1% of the reference genome, and that each bin was populated with a total of 30 SVs from the entire set of mock genomes. To benchmark the individual callers, a separate testing dataset, containing 129 mock genomes, was created with the same distribution of SV types and sizes.
Short-read DNA sequencing of the mock genomes was simulated with the randomreads.sh script included with BBTools (sourceforge.net/projects/bbmap; v38.79). Paired-end reads of 100bp were simulated at 5X, 15X, 30X, and 60X depths of coverage using the Illumina error model with default settings. SimLoRD [49] (v.1.0.4) was used to simulate PacBio sequencing data for the mock genomes. The SimLoRD PacBio sequencing runs were simulated at a depth of 5X, 15X, 30X, 60X, and 142X (the median depth of real PacBio data used to predict SVs using long-read tools).

Real data
Data from the Caenorhabditis elegans Natural Diversity Resource (CeNDR) [23] were used to predict SVs in 14 C. elegans isolates collected from the wild. SV prediction using the SVE/ FusorSV pipeline was performed using the BAM files provided for the 20200815 CeNDR release. SimuSCoP [50] (v1.0) was used to generate the simulated DNA-sequencing data that trained the FusorSV model. DNA sequencing of the C. elegans N2 reference strain were used to create the sequencing profiles used by SimuSCoP (SRA run = SRR3452263, SRA run = SRR1013928, SRA run = SRR9719854). For each strain, FusorSV models trained on simulated data of similar sequencing depth and read length were used to predict variants.
BC4586, a C. elegans strain containing validated structural variants, was used to evaluate the ability of the SVE/FusorSV pipeline in the prediction of experimentally validated structural variants. Simulated SimuSCoP DNA sequencing data was generated using an N2 profile (SRA run = SRR14489487) and used to train the FusorSV model. SVs for BC4586 (SRA run = SRR14489485) were predicted using the SVE/FusorSV pipeline and used to identify the presence of a deletion on chromosome IV (coordinates = 9853675-9857585), a tandem duplication on chromosome IV (coordinates = 9857585-9862397), and an inversion on chromosome IV (coordinates = 9853123-9853675).

Structural variation benchmarking and comparison
Variant calling resulted in multiple overlapping structural variants of the same type, which can lead to inflated performance metrics, as each prediction may be counted as a true positive when compared to the truth dataset. For overlapping SV predictions of the same type and caller, a single call was selected using the criteria described in S13 Table. When no discriminating information was available among overlapping calls, the final SV was selected randomly.
Benchmarking using simulated data. For simulated data, each predicted variant was classified as being either a true positive (TP) or false positive (FP) using the Bedtools intersect command. Predictions that overlapped (minimum reciprocal overlap = 0.5) with at least one simulated variant in the mock genome were classified as true positives (TP), and those calls that did not were classified as false positives (FP). Simulated variants that were not predicted were classified as false negatives (FN). These classifications were used to calculate the following performance metrics: Precision is the ratio of true positives (TP) to the total predicted variants, and was calculated as follows: Precision = TP / (TP + FP) Recall is the ratio of true positives to the total number of simulated variants, and was calculated as follows: Recall = TP / (TP + FN) The F 1 score provides a measure of the prediction accuracy by taking the weighted average of the precision and recall. The F1 score was calculated as follows: F 1 = 2 � precision � recall/ (precision + recall) Because the precision, recall, and F1 scores were calculated using binary classifications, they provide an incomplete picture. For each SV type, predictions meeting the minimum reciprocal overlap threshold are classified as true positives but may include calls with imprecise breakpoints or size estimates. Furthermore, no single agreed upon threshold exists for reciprocal overlap and the cutoff value is typically chosen arbitrarily. SVs may be called as multiple separate events that span the true variant [15], none of which meet the minimum reciprocal overlap requirement. If identifying genomic regions containing structural variation is of greater importance to the analyst compared to the precision of individual calls, metrics that are calculated using presence or absence classifications may be unsuitable.
For example, true positives are not penalized for predicting breakpoints outside of the region of the structural variant. Similarly, true positives are not penalized for predicting breakpoints within the true breakpoints of the structural variant. Therefore, the Jaccard index was calculated to measure the amount of overlap between the predicted variants and simulated variants.
The Jaccard index was calculated using the ratio of the number of base pairs in the intersection and union of the predicted and simulated variants: Jaccard = (prediction variants \ simulated variants) / (prediction variants [ simulated variants) Because the Jaccard score is calculated using the union and intersection of base pairs in the predicted and truth sets, it should be noted that this metric is biased towards the performance of larger variants.

Comparisons using real data
Due to the lack of real data containing known structural variants, the sets of genes spanned by SVs predicted by each caller were compared to evaluate the consistency between different tools. C. elegans genome annotations obtained from WormBase [48] (release = WBPS14) were used to identify which genes were spanned by SVs in the CeNDR data. The gene set was limited to protein-coding genes where at least 50% of the gene was covered by an SV. SVs larger than 50 kbp were excluded due to this being the maximum size considered to be reliable in Assemblytics. For each comparison between callers, only predictions of a given SV type that spanned the exact same set of genes were considered to be in agreement with each other. The agreement between callers was depicted using UpSet plots [51].
Supporting information S1